# High-resolution image classification

Swinv2 Large Patch4 Window12 192 22k
Apache-2.0
Swin Transformer v2 is a vision Transformer model that achieves efficient image classification and dense recognition tasks through hierarchical feature maps and local window self-attention mechanisms.
Image Classification Transformers
S
microsoft
3,816
10
Swinv2 Small Patch4 Window8 256
Apache-2.0
Swin Transformer v2 is a vision Transformer model that achieves efficient image processing through hierarchical feature maps and local window self-attention mechanisms.
Image Classification Transformers
S
microsoft
1,836
0
Cvt W24 384 22k
Apache-2.0
CvT-w24 is a vision transformer model pre-trained on ImageNet-22k and fine-tuned at 384x384 resolution, improving traditional vision transformers through convolutional enhancements.
Image Classification Transformers
C
microsoft
66
0
Cvt 13 384 22k
Apache-2.0
CvT-13 is a vision model combining convolution and Transformer, pre-trained on ImageNet-22k and fine-tuned on ImageNet-1k, suitable for image classification tasks.
Image Classification Transformers
C
microsoft
508
0
Vit Large Patch32 384
Apache-2.0
This Vision Transformer (ViT) model is pre-trained on the ImageNet-21k dataset and then fine-tuned on the ImageNet dataset, suitable for image classification tasks.
Image Classification
V
google
118.37k
16
Beit Large Patch16 512
Apache-2.0
BEiT is a vision Transformer-based image classification model, pre-trained in a self-supervised manner on ImageNet-21k and fine-tuned on ImageNet-1k.
Image Classification
B
microsoft
683
11
Vit Base Patch32 384
Apache-2.0
Vision Transformer (ViT) is an image classification model based on the Transformer architecture, achieving efficient image recognition capabilities through pre-training and fine-tuning on the ImageNet-21k and ImageNet datasets.
Image Classification
V
google
24.92k
20
Vit Base Patch16 384
Apache-2.0
Vision Transformer (ViT) is an image classification model based on the Transformer architecture, pre-trained on ImageNet-21k and fine-tuned on ImageNet.
Image Classification
V
google
30.30k
38
Vit Large Patch16 384
Apache-2.0
Vision Transformer (ViT) is an image classification model based on the transformer architecture, pre-trained on ImageNet-21k and fine-tuned on ImageNet.
Image Classification
V
google
161.29k
12
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase